The following packages are required for analysis and visualization in this page:
# General data wrangling
library(tidyverse)
library(DT)
library(readxl)
# Visualization
library(plotly)
library(colorRamps)
library(RColorBrewer)
library(colorspace)
library(NeuralNetTools)
library(gridExtra)
library(ggplotify)
library(knitr)
library(kableExtra)
# Modeling
library(glmnet)
library(caret)
library(ranger)
# Ensemble building
library(caretEnsemble)
This page focuses on developing both individual models and model ensembles to predict annual change in CDR-Sum of Boxes based on principle components (PCs) derived from regional tau-PET uptake per year, age at baseline, and sex. This data was prepared in Data Preparation and is the PCA-transformed dataset.
First, load in the data prepared in the previous data preparation phase:
load("../Prepared_Data.RData")
As per standard convention in model development, I will randomly partition this dataset into training and testing subsets, such that the models are trained and evaluated in independent data subsets. I will use the sample function with a specific seed set (127) to partition the data into 75% training and 25% testing.
# Set seed for consistency in random sampling for 10-fold cross-validation
set.seed(127)
train.index <- sample(nrow(cortex.changes), nrow(cortex.changes)*0.75, replace=F)
# Remove unneccessary identifier info from datasets for modeling
cortex.df <- cortex.changes %>% ungroup() %>% select(-RID, -interval_num)
# Pre-processing will be applied in model training with caret
# Subset training + test data for cortical lobe-averaged tau uptake per year data
cortex.train <- cortex.df[train.index, ]
cortex.test <- cortex.df[-train.index, ]
I will be using the caretEnsemble package to compile four individual regression models into a stacked ensemble. This package enables evaluation of the models individually as well as together in the ensemble.
The first step is to create a caretList of the four regression models I will use:
glmnet)knn)nnet)ranger)I will use the trainControl function to specify ten-fold cross-validation with parallel processing.
# trainControl for 10-fold CV and parallel processing
ensemble.control <- trainControl(method="cv", number=10, allowParallel=T)
I’m using the RMSE as the metric according to which model parameters are tuned. All input data (which includes training and test data) will be automatically standardized using the preProcess center + scaling feature.
# Set seed for consistency
set.seed(127)
# Neural net -- linout=T means linear output (i.e. not constrained to be [0,1]),
# try a range of hidden layer sizes and weight decays
my.neuralnet <- caretModelSpec(method="nnet", linout=T, trace=F, tuneGrid = expand.grid(size=c(1, 3, 5, 10, 15), decay = seq(0, 1, by=0.2)))
# k-nearest neighbors: try 20 different values of k
my.knn <- caretModelSpec(method="knn", tuneLength=20)
# random forest: try 15 different numbers of features considered at each node and use 500 sampled trees
my.randomforest <- caretModelSpec(method="ranger", tuneLength=15, num.trees=500, importance="permutation")
# elastic net: try four different values of alpha for ridge/lasso blending and four lambda values for coefficient penalty
my.elasticnet <- caretModelSpec(method="glmnet",tuneGrid=expand.grid(alpha=c(0,0.1,0.6,1), lambda=c(5^-5,5^-3,5^-1,1)))
# Compile individual models into one cohesive model list using caretList
invisible(capture.output(ensemble.models <- caretList(CDRSB ~ .,
data=cortex.train,
trControl=ensemble.control,
metric="RMSE",
preProcess=c("center", "scale"),
tuneList=list(my.neuralnet, my.knn, my.randomforest, my.elasticnet))))
The final chosen parameters for each model can be viewed:
Elastic net# Print elastic net model summary
ensemble.models$glmnet
# Elastic net cross-validation results
glmnet.alpha <- ensemble.models$glmnet$results$alpha
glmnet.rmse <- ensemble.models$glmnet$results$RMSE
glmnet.mae <- ensemble.models$glmnet$results$MAE
glmnet.lambda <- ensemble.models$glmnet$results$lambda
# Plot the RMSE and MAE in a facet plot using facet_wrap
p.glmnet.cv <- data.frame(alpha=glmnet.alpha, RMSE=glmnet.rmse, MAE=glmnet.mae, lambda=glmnet.lambda) %>%
mutate(alpha=as.character(alpha)) %>%
pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
# One line per value of alpha
ggplot(data=., mapping=aes(x=lambda, y=Value, color=alpha)) +
# Facet the MAE and RMSE in separate plots
facet_wrap(Metric ~ ., scales="free") +
geom_point() +
geom_line(aes(group=alpha)) +
theme_minimal() +
ggtitle("Elastic Net Regression Cross-Validated Results") +
theme(plot.title=element_text(hjust=0.5),
axis.title=element_blank(),
panel.spacing = unit(2, "lines"))
# Convert to interactive plotly
ggplotly(p.glmnet.cv) %>%
layout(yaxis = list(title = "MAE",
titlefont = list(size = 12)),
xaxis = list(title = "lambda",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "lambda",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
# Clear workspace
rm(glmnet.alpha, glmnet.rmse, glmnet.lambda, glmnet.mae, p.glmnet.cv, train.index)
## glmnet
##
## 246 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.0 0.00032 1.094799 0.15609471 0.7012543
## 0.0 0.00800 1.094799 0.15609427 0.7012543
## 0.0 0.20000 1.094401 0.13804665 0.6929446
## 0.0 1.00000 1.089264 0.13299917 0.6888115
## 0.1 0.00032 1.097088 0.15666046 0.7034021
## 0.1 0.00800 1.096919 0.15533476 0.7021739
## 0.1 0.20000 1.095433 0.14082072 0.6910526
## 0.1 1.00000 1.082621 0.09731275 0.6910564
## 0.6 0.00032 1.097076 0.15671643 0.7033811
## 0.6 0.00800 1.096898 0.15299436 0.7006980
## 0.6 0.20000 1.086487 0.10555798 0.6936131
## 0.6 1.00000 1.072871 NaN 0.6855987
## 1.0 0.00032 1.097177 0.15666515 0.7033982
## 1.0 0.00800 1.097156 0.14965997 0.6997010
## 1.0 0.20000 1.073568 0.27522992 0.6859727
## 1.0 1.00000 1.072871 NaN 0.6855987
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.6 and lambda = 1.
kNN
# Print kNN model summary
ensemble.models$knn
# kNN cross-validation plot
knn.k <- ensemble.models$knn$results$k
knn.rmse <- ensemble.models$knn$results$RMSE
knn.mae <- ensemble.models$knn$results$MAE
# Plot the RMSE and MAE in a facet plot using facet_wrap
p.knn.cv <- data.frame(k=knn.k, RMSE=knn.rmse, MAE=knn.mae) %>%
pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
# One line per value of alpha
ggplot(data=., mapping=aes(x=k, y=Value, color=Metric)) +
# Facet the MAE and RMSE in separate plots
facet_wrap(Metric ~ ., scales="free") +
geom_point() +
geom_line() +
theme_minimal() +
ggtitle("kNN Regression Cross-Validated Results") +
theme(plot.title=element_text(hjust=0.5),
axis.title=element_blank(),
panel.spacing = unit(2, "lines"))
# Convert to interactive plotly
ggplotly(p.knn.cv) %>%
layout(yaxis = list(title = "MAE",
titlefont = list(size = 12)),
xaxis = list(title = "k",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "k",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
rm(knn.rmse, knn.k, p.knn.cv, knn.mae)
## k-Nearest Neighbors
##
## 246 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.107192 0.07302604 0.6501571
## 7 1.103811 0.05925652 0.6537839
## 9 1.092773 0.03955904 0.6489966
## 11 1.080964 0.04281474 0.6400736
## 13 1.063219 0.05810530 0.6249247
## 15 1.054257 0.06719400 0.6222031
## 17 1.063972 0.05253539 0.6255530
## 19 1.067825 0.04350845 0.6238405
## 21 1.066776 0.04900117 0.6226638
## 23 1.073400 0.04351178 0.6239306
## 25 1.070993 0.04368831 0.6199548
## 27 1.071035 0.04792044 0.6177739
## 29 1.069877 0.05075853 0.6172153
## 31 1.069343 0.04840288 0.6143285
## 33 1.077242 0.04448998 0.6180380
## 35 1.076238 0.04480808 0.6152146
## 37 1.078176 0.04614663 0.6150701
## 39 1.083048 0.03764211 0.6205318
## 41 1.081515 0.03624130 0.6188928
## 43 1.079126 0.03179071 0.6168043
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
Neural Network
# Print neural network model summary
ensemble.models$nnet
# Neural network cross-validation plot
n.neurons <- ensemble.models$nnet$results$size
nnet.rmse <- ensemble.models$nnet$results$RMSE
nnet.mae <- ensemble.models$nnet$results$MAE
nnet.weight <- ensemble.models$nnet$results$decay
# Plot the RMSE and MAE in a facet plot using facet_wrap
p.nnet.cv <- data.frame(n.neurons, RMSE=nnet.rmse, MAE=nnet.mae, decay=nnet.weight) %>%
mutate(decay=as.character(decay)) %>%
pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
# One line per value of alpha
ggplot(data=., mapping=aes(x=n.neurons, y=Value, color=decay)) +
# Facet the MAE and RMSE in separate plots
facet_wrap(Metric ~ ., scales="free") +
geom_point() +
geom_line(aes(group=decay)) +
theme_minimal() +
ggtitle("Neural Network Regression Cross-Validated Results") +
theme(plot.title=element_text(hjust=0.5),
axis.title=element_blank(),
panel.spacing = unit(2, "lines"))
# Convert to interactive plotly
ggplotly(p.nnet.cv) %>%
layout(yaxis = list(title = "MAE",
titlefont = list(size = 12)),
xaxis = list(title = "# Neurons in Hidden Layer",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "# Neurons in Hidden Layer",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
rm(n.neurons, nnet.rmse, nnet.mae, nnet.weight, p.nnet.cv)
## Neural Network
##
## 246 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.0 1.245690 0.15166462 0.7053475
## 1 0.2 1.198075 0.09028417 0.7142961
## 1 0.4 1.169390 0.10492935 0.7057446
## 1 0.6 1.141485 0.12019571 0.7016722
## 1 0.8 1.110522 0.16281964 0.6976585
## 1 1.0 1.107683 0.16027783 0.6970955
## 3 0.0 1.712916 0.13281912 0.9139845
## 3 0.2 1.251255 0.03420884 0.7894735
## 3 0.4 1.198274 0.10061319 0.7599557
## 3 0.6 1.159946 0.06386280 0.7331481
## 3 0.8 1.188469 0.06028438 0.7321538
## 3 1.0 1.133342 0.07567047 0.6947239
## 5 0.0 10.468680 0.10319745 2.7156369
## 5 0.2 1.447784 0.09856635 0.8949424
## 5 0.4 1.365099 0.09001070 0.8732478
## 5 0.6 1.218252 0.06958312 0.7633906
## 5 0.8 1.196107 0.10006494 0.7481264
## 5 1.0 1.182999 0.10469178 0.7279717
## 10 0.0 1.998661 0.07082981 1.2652291
## 10 0.2 1.453192 0.06660054 0.9800577
## 10 0.4 1.310242 0.03110645 0.8858864
## 10 0.6 1.255780 0.09319997 0.8084398
## 10 0.8 1.234862 0.07430756 0.7592734
## 10 1.0 1.187241 0.11712181 0.7243907
## 15 0.0 1.741426 0.05719791 1.2122015
## 15 0.2 1.474091 0.07627155 1.0129055
## 15 0.4 1.371093 0.03945652 0.9030756
## 15 0.6 1.305277 0.08997026 0.8305314
## 15 0.8 1.227082 0.12982494 0.7615322
## 15 1.0 1.171387 0.08693675 0.7287818
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 1.
The optimal neural network includes three neurons in the hidden layer, each of which receive input from all 33 input nodes (31 ROIs, baseline age, and sex) and output onto the final prediction node. Here’s a graphical representation of this caret-trained neural network:
par(mar = numeric(4))
plotnet(ensemble.models$nnet, cex_val=0.8, pad_x=0.6, pos_col="firebrick3", neg_col="dodgerblue4",
circle_col="lightslategray", bord_col="lightslategray", alpha_val=0.4)
Random Forest
# Print random forest model summary
ensemble.models$ranger
# Random forest cross-validation plot
splitrule <- ensemble.models$ranger$results$splitrule
numpred <- ensemble.models$ranger$results$mtry
rf.rmse <- ensemble.models$ranger$results$RMSE
rf.mae <- ensemble.models$ranger$results$MAE
# Plot the RMSE and MAE in a facet plot using facet_wrap
p.rf.cv <- data.frame(splitrule, RMSE=rf.rmse, MAE=rf.mae, numpred) %>%
pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
# One line per value of alpha
ggplot(data=., mapping=aes(x=numpred, y=Value, color=splitrule)) +
# Facet the MAE and RMSE in separate plots
facet_wrap(Metric ~ ., scales="free") +
geom_point() +
geom_line(aes(group=splitrule)) +
theme_minimal() +
ggtitle("Random Forest Regression Cross-Validated Results") +
theme(plot.title=element_text(hjust=0.5),
axis.title=element_blank(),
panel.spacing = unit(2, "lines"))
# Convert to interactive plotly
ggplotly(p.rf.cv) %>%
layout(yaxis = list(title = "MAE",
titlefont = list(size = 12)),
xaxis = list(title = "# Predictors in Decision Node",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "# Predictors in Decision Node",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)
rm(splitrule, numpred, rf.rmse, rf.mae, p.rf.cv)
## Random Forest
##
## 246 samples
## 8 predictor
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ...
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 1.133259 0.08268213 0.7057911
## 2 extratrees 1.079462 0.06854085 0.6725601
## 3 variance 1.171525 0.09425731 0.7236953
## 3 extratrees 1.098156 0.06807368 0.6828059
## 4 variance 1.182269 0.08854006 0.7337961
## 4 extratrees 1.098247 0.07611410 0.6831745
## 5 variance 1.186409 0.08702435 0.7325433
## 5 extratrees 1.106771 0.08655482 0.6916539
## 6 variance 1.203915 0.08312578 0.7374184
## 6 extratrees 1.105800 0.09635742 0.6947309
## 7 variance 1.205372 0.08462786 0.7381463
## 7 extratrees 1.113361 0.08141956 0.6982263
## 8 variance 1.219286 0.08633762 0.7451829
## 8 extratrees 1.112603 0.10405555 0.6954723
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 2, splitrule = extratrees and min.node.size = 5.
These four models can be resampled and aggregated using the resamples function:
# Resample the performance of this ensemble and report summary metrics for MAE, RMSE, and R2
set.seed(127)
ensemble.results <- resamples(ensemble.models)
summary(ensemble.results)
##
## Call:
## summary.resamples(object = ensemble.results)
##
## Models: nnet, knn, ranger, glmnet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## nnet 0.4559073 0.5362215 0.7502167 0.6970955 0.7991874 0.9266854 0
## knn 0.3657926 0.4400819 0.6808184 0.6222031 0.7410961 0.9337061 0
## ranger 0.4617867 0.5156474 0.7147103 0.6725601 0.7819681 0.9799226 0
## glmnet 0.4474888 0.5692098 0.7479911 0.6855987 0.7901979 0.9026879 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## nnet 0.5445679 0.7521224 1.173401 1.107683 1.308853 1.829196 0
## knn 0.4652121 0.6544561 1.088401 1.054257 1.401174 1.783437 0
## ranger 0.5581244 0.7150456 1.118856 1.079462 1.353021 1.865138 0
## glmnet 0.5007390 0.7217365 1.101043 1.072871 1.375093 1.818658 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## nnet 0.0011153793 0.008785862 0.17664760 0.16027783 0.27709588 0.3569507 0
## knn 0.0006141381 0.001677300 0.01897170 0.06719400 0.10810592 0.2946562 0
## ranger 0.0014000337 0.014789597 0.05079104 0.06854085 0.08740914 0.2780762 0
## glmnet NA NA NA NaN NA NA 10
Looking at the mean RMSE across sample iterations, the elastic net (glmnet) has the lowest RMSE; by contrast, the neural network has the highest RMSE. This reports the \(R^2\) values as well, though as the models are non-linear regressions (aside from elastic net), this isn’t a valid comparison metric in this instance.
It’s also useful to look at the regression correlation between these component models:
cat("\nRoot-Mean Square Error Correlation\n")
# Calculate model RMSE correlations
modelCor(ensemble.results, metric="RMSE")
cat("\n\nMean Absolute Error Correlation\n")
# Calculate model MAE correlations
modelCor(ensemble.results, metric="MAE")
##
## Root-Mean Square Error Correlation
## nnet knn ranger glmnet
## nnet 1.0000000 0.9466783 0.9745663 0.9587457
## knn 0.9466783 1.0000000 0.9873108 0.9923165
## ranger 0.9745663 0.9873108 1.0000000 0.9911293
## glmnet 0.9587457 0.9923165 0.9911293 1.0000000
##
##
## Mean Absolute Error Correlation
## nnet knn ranger glmnet
## nnet 1.0000000 0.9258331 0.9281142 0.9514522
## knn 0.9258331 1.0000000 0.9806123 0.9811033
## ranger 0.9281142 0.9806123 1.0000000 0.9678099
## glmnet 0.9514522 0.9811033 0.9678099 1.0000000
kNN and random forest show very high error correlation (R>0.98). The other models also show relatively high error correlation; this is not ideal, since ensemble models are designed to counterbalance individual model error. The error can be visualized with confidence intervals using the dotplot function:
# Plot the four models' RMSE values
p.rmse <- as.ggplot(dotplot(ensemble.results, metric="RMSE"))
# Plot the four models' MAE values
p.mae <- as.ggplot(dotplot(ensemble.results, metric="MAE"))
# Combine plots side-by-side
grid.arrange(p.rmse, p.mae, ncol=2)
These four models show comparable RMSE confidence intervals. The elastic net regression model shows the lowest estimated RMSE, while the neural network shows the highest estimated RMSE. kNN showed the lowest MAE, while the neural network also showed the highest MAE. Despite the large error correlation between the models, I’ll move forward to see if ensembling strengthens the overall predictions for the change in CDR-Sum of Boxes over time.
Starting with the basic caretEnsemble function, which by default employs a generalized linear model to combine the component models:
set.seed(127)
# Set trainControl --> 10-fold CV with parallel processing
ensemble.control <- trainControl(method="repeatedcv", number=10, allowParallel = T)
# Create stacked ensemble, optimizing for RMSE
stacked.ensemble.glm <- caretEnsemble(ensemble.models, metric="RMSE", trControl=ensemble.control)
summary(stacked.ensemble.glm)
## The following models were ensembled: nnet, knn, ranger, glmnet
## They were weighted:
## 2.9362 0.1623 0.9061 -0.2833 -7.7576
## The resulting RMSE is: 1.105
## The fit for each individual model on the RMSE is:
## method RMSE RMSESD
## nnet 1.107683 0.4105291
## knn 1.054257 0.4469817
## ranger 1.079462 0.4234083
## glmnet 1.072871 0.4254785
The elastic net model (glmnet) shows the lowest RMSE of the four models. caret offers a function autoplot to create in-depth diagnostic plots for ensemble models:
autoplot(stacked.ensemble.glm)
The top left graph shows the mean cross-validated RMSE per model, with the bars denoting RMSE standard deviation. The elastic net model shows the lowest RMSE of the four, and its RMSE is even lower than that of the full ensemble model (indicated by the red dashed line). The middle left plot shows the relative weight given to each model in a linear combination; the elastic net has the highest weighting, followed by ranger and neural network; kNN actually has a negative weight.
Stacked ensembles can also be constructed using caretStack, which applies user-defined linear combinations of each constituent model. I’ll try one using a random forest combination of models and one using an elastic net combination of models.
set.seed(127)
# Create random forest-combined stacked ensemble
stacked.ensemble.rf <- caretStack(ensemble.models, method = "rf", metric = "RMSE", trControl = ensemble.control)
# Create elastic net-combined stacked ensemble
stacked.ensemble.glmnet <- caretStack(ensemble.models, method="glmnet", metric="RMSE", trControl = ensemble.control)
The summary statistics for each model can be displayed:
cat("\nStacked ensemble, generalized linear model:\n")
stacked.ensemble.glm
cat("\n\nStacked ensemble, random forest:\n")
stacked.ensemble.rf
cat("\n\nStacked ensemble, elastic net:\n")
stacked.ensemble.glmnet
##
## Stacked ensemble, generalized linear model:
## A glm ensemble of 4 base models: nnet, knn, ranger, glmnet
##
## Ensemble results:
## Generalized Linear Model
##
## 246 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.105009 0.05202703 0.6987817
##
##
##
## Stacked ensemble, random forest:
## A rf ensemble of 4 base models: nnet, knn, ranger, glmnet
##
## Ensemble results:
## Random Forest
##
## 246 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.176735 0.04518570 0.7354206
## 3 1.186757 0.06273812 0.7392686
## 4 1.186169 0.07393237 0.7356158
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
##
##
## Stacked ensemble, elastic net:
## A glmnet ensemble of 4 base models: nnet, knn, ranger, glmnet
##
## Ensemble results:
## glmnet
##
## 246 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 222, 221, 220, 222, 221, 221, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0003759611 1.081113 0.05213553 0.6930661
## 0.10 0.0037596112 1.080933 0.05217147 0.6928954
## 0.10 0.0375961122 1.077394 0.05382963 0.6895900
## 0.55 0.0003759611 1.081104 0.05217397 0.6930713
## 0.55 0.0037596112 1.080092 0.05285595 0.6920865
## 0.55 0.0375961122 1.071503 0.05622865 0.6847089
## 1.00 0.0003759611 1.081109 0.05217428 0.6930760
## 1.00 0.0037596112 1.079285 0.05350495 0.6913866
## 1.00 0.0375961122 1.068036 0.05924443 0.6840631
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.03759611.
Each of these four models show pretty comparable RMSE and MAE values. The RMSE is slighly lower for the glmnet-combined stack ensemble model, though this difference may not be significant and may not translate to the out-of-sample data.
These three ensemble models (glm-, rf-, and glmnet-combined) will be used to predict annual change in CDR-Sum of Boxes in the same test dataset. Note: since the models were created with center and scale preprocessing specified, the test data does not need to be manually pre-processed.
Model predictions using training data:
# Predict based on training data for the four individual models
# And the three stacked ensemble models
glmnet.train <- predict.train(ensemble.models$glmnet)
knn.train <- predict.train(ensemble.models$knn)
nnet.train <- predict.train(ensemble.models$nnet)
rf.train <- predict(ensemble.models$ranger)
ensemble.glm.train <- predict(stacked.ensemble.glm)
ensemble.glmnet.train <- predict(stacked.ensemble.glmnet)
ensemble.rf.train <- predict(stacked.ensemble.rf)
real.train <- cortex.train$CDRSB
# Combine these predictions into a dataframe for easier viewing
train.df <- do.call(cbind, list(elastic.net=glmnet.train, knn=knn.train, neural.net=nnet.train, random.forest=rf.train,
ensemble.glm=ensemble.glm.train, ensemble.glmnet=ensemble.glmnet.train,
ensemble.rf=ensemble.rf.train, real.CDR=real.train)) %>% as.data.frame()
datatable(train.df %>% mutate_if(is.numeric, function(x) round(x,4)) %>% select(real.CDR, elastic.net:ensemble.rf))
Model predictions using test data:
# Predict based on UNSEEN test data for the four individual models
# And the three stacked ensemble models
real.test <- cortex.test$CDRSB
glmnet.test <- predict.train(ensemble.models$glmnet, newdata=cortex.test)
knn.test <- predict.train(ensemble.models$knn, newdata=cortex.test)
nnet.test <- predict.train(ensemble.models$nnet, newdata=cortex.test)
rf.test <- predict.train(ensemble.models$ranger, newdata=cortex.test)
ensemble.glm.test <- predict(stacked.ensemble.glm, newdata=cortex.test)
ensemble.glmnet.test <- predict(stacked.ensemble.glmnet, newdata=cortex.test)
ensemble.rf.test <- predict(stacked.ensemble.rf, newdata=cortex.test)
# Combine these predictions into a dataframe for easier viewing
test.df <- do.call(cbind, list(real.CDR=real.test, elastic.net=glmnet.test, knn=knn.test, neural.net=nnet.test, random.forest=rf.test,
ensemble.glm=ensemble.glm.test, ensemble.glmnet=ensemble.glmnet.test, ensemble.rf=ensemble.rf.test)) %>% as.data.frame()
datatable(test.df %>% mutate_if(is.numeric, function(x) round(x,4)))
I want to compare these three ensemble models in terms of how their predictions relate to the actual CDR-SoB values in the training and testing data. I also want to compare these results with those obtained with the individual component models to see if constructing the ensemble confers a predictive advantage. To quantify the association between real CDR-SoB values and model-predicted, I will use the \(R^2\) and RMSE values through the R2 and RMSE functions from caret:
# Calculate the RMSE between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
rmse.train <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.train, real.train),
ensemble.glm=RMSE(ensemble.glm.train, real.train),
ensemble.rf=RMSE(ensemble.rf.train, real.train),
elastic.net=RMSE(glmnet.train, real.train),
knn=RMSE(knn.train, real.train),
neural.net=RMSE(nnet.train, real.train),
random.forest=RMSE(rf.train, real.train),
Metric="Train_RMSE")
# Calculate the RMSE between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
rmse.test <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.test, real.test),
ensemble.glm=RMSE(ensemble.glm.test, real.test),
ensemble.rf=RMSE(ensemble.rf.test, real.test),
elastic.net=RMSE(glmnet.test, real.test),
knn=RMSE(knn.test, real.test),
neural.net=RMSE(nnet.test, real.test),
random.forest=RMSE(rf.test, real.test),
Metric="Test_RMSE")
# Calculate the R-squared between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
r2.train <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.train, real.train),
ensemble.glm=R2(ensemble.glm.train, real.train),
ensemble.rf=R2(ensemble.rf.train, real.train),
elastic.net=R2(glmnet.train, real.train),
knn=R2(knn.train, real.train),
neural.net=R2(nnet.train, real.train),
random.forest=R2(rf.train, real.train),
Metric="Train_R2")
# Calculate the R-squared between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
r2.test <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.test, real.test),
ensemble.glm=R2(ensemble.glm.test, real.test),
ensemble.rf=R2(ensemble.rf.test, real.test),
elastic.net=R2(glmnet.test, real.test),
knn=R2(knn.test, real.test),
neural.net=R2(nnet.test, real.test),
random.forest=R2(rf.test, real.test),
Metric="Test_R2")
# Combine all four prediction dataframes into one table to compare and contrast
# RMSE and R-squared across models
do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
mutate(Metric = str_replace(Metric, "_", " ")) %>%
pivot_wider(id_cols="Model", names_from="Metric", values_from="Value") %>%
mutate_if(is.numeric, function(x) round(x,4)) %>%
kable(., booktabs=T) %>% kable_styling(full_width=F)
| Model | Train RMSE | Test RMSE | Train R2 | Test R2 |
|---|---|---|---|---|
| ensemble.glmnet | 1.0955 | 0.9042 | 0.1413 | 0.0007 |
| ensemble.glm | 1.1884 | 0.9151 | 0.0290 | 0.0000 |
| ensemble.rf | 1.1162 | 0.9163 | 0.0523 | 0.0038 |
| elastic.net | 1.1418 | 0.8978 | NA | NA |
| knn | 1.0789 | 0.9357 | 0.1413 | 0.0007 |
| neural.net | 1.0769 | 0.8896 | 0.1235 | 0.0424 |
| random.forest | 0.6912 | 0.8785 | 0.8890 | 0.0481 |
I will also use scatterplot visualizations for comparison:
# Create training data ggplot to be converted to interactive plotly plot
p.train <- train.df %>%
# Reshape to facet on model
pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
geom_point(alpha=0.3) +
facet_grid(.~Model, scales="free") +
ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Training Data") +
theme_minimal() +
theme(legend.position="none") +
ylab("Predicted CDR-SoB Change") +
xlab("Actual CDR-SoB Change") +
theme(plot.title=element_text(hjust=0.5))
# Create test data ggplot to be converted to interactive plotly plot
p.test <- test.df %>%
# Reshape to facet on model
pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
geom_point(alpha=0.3) +
facet_grid(.~Model, scales="free") +
ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Test Data") +
theme_minimal() +
theme(legend.position="none") +
ylab("Predicted CDR-SoB Change") +
xlab("Actual CDR-SoB Change") +
theme(plot.title=element_text(hjust=0.5))
# Use ggplotly to create interactive HTML plots
ggplotly(p.train, height=350, width=900)
ggplotly(p.test, height=350, width=900)
The predicted CDR-SoB change values from the random forest individual model were very similar to the actual observed values, yielding an \(R^2\) of 0.94. However, comparing this with the \(R^2\) of 0.16 in the test dataset suggests that the model may be overfit to the training data and does not perform well outside of that dataset. The same can be said of the other models, all of which exhibited substantially
# Compile RMSE and R2 results comparing real vs. predicted values for ensembles and component models
overall.ensemble.results <- do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
# Reshape to facet on metric -- i.e. RMSE or R2
pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
separate(Metric, into=c("Data", "Metric"), sep="_")
p.ensemble.r2.rmse <- overall.ensemble.results %>%
mutate(Metric = ifelse(Metric=="RMSE", "Real vs. Predicted CDR-SoB RMSE", "Real vs. Predicted CDR-SoB R2")) %>%
mutate(Data=factor(Data, levels=c("Train", "Test")),
Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
ggplot(data=., mapping=aes(x=Data, y=Value, color=Model, group=Model)) +
geom_point() +
geom_line() +
theme_minimal() +
facet_wrap(Metric~., scales="free", nrow=1) +
theme(strip.text=element_text(size=12, face="bold"),
axis.title=element_blank())
# Convert to interactive plotly plot, rename x/y axis titles
ggplotly(p.ensemble.r2.rmse) %>%
layout(yaxis = list(title = "R2",
titlefont = list(size = 12)),
xaxis = list(title = "Data Subset",
titlefont = list(size = 12)),
yaxis2 = list(title = "RMSE",
titlefont = list(size = 12)),
xaxis2 = list(title = "Data Subset",
titlefont = list(size = 12)),
autosize = F, width = 900, height = 400)